Kash’s Portfolio - Mastering Standard Deviation: A Comprehensive Guide with Python Implementation

Why Standard Deviation

We get millions of data which we are required to transform and then create a model for our business use-case. Data Scientists tend to spend ~80% of the time cleaning and transorming the data while only 20% on discovering insights and developing models. This transformation part includes a primitive mathematical concept that we all learned in our high-school, i.e., Standard Deviation.

Before jumping on to transformation part, we need to understand the data in terms of statistics. We have to understand characteristics of data such as statistical features like mean, variance; the distribution it follows like normal, uniform, poisson, etc. Let’s understand S.D through a practical application..

Understanding use-case for SD

Imagine we have a massive dataset with millions of data points, and some of these data points stand out because they have extremely high or low values. Now, the problem is that these unusual data points can mess up our graphs by making the scales look strange. What makes it tricky is that the number of these unusual data points can change. Sometimes, we might have only regular data, while other times, these strange data points can make up as much as 10% of our data.

So, the solution here is to get rid of these odd data points before we make our graphs. We can do this by ignoring any values that are way above (Mean + 2SD) or way below (Mean - 2SD) before we start plotting our data.

Implementing in Python

To have a S.D., we first need a mean of the numbers. So, let’s start with creating our mean function:

import numpy as np
import math

def mean(list_of_numbers):
    return sum(list_of_numbers)/ len(list_of_numbers)

def std_dev(list_of_numbers):
    if (len(list_of_numbers)) !=0:
        avg = mean(list_of_numbers)
        variance = sum([(i - avg)**2 for i in list_of_numbers])/len(list_of_numbers)
        standard_dev = math.sqrt(variance)
        return standard_dev
    return np.nan

Now that we’ve defined our functions, let’s create a list of length 10 with random integers

list_of_numbers = np.random.randint(1,30,10)

print(list_of_numbers)

[ 5 27 12  5  2 20 10  8 29 23]

print(f'Standard Deviation for the list of numbers provided = {std_dev(list_of_numbers):0.2f}')

Standard Deviation for the list of numbers provided = 9.34

Using the statistics module

import statistics

list_ = [int(item) for item in list_of_numbers]
print(f'Standard Deviation using the statistics module for the list of numbers provided = {statistics.stdev(list_):0.2f}')

Standard Deviation using the statistics module for the list of numbers provided = 9.85

Note: Not providing int(item) produces an error AttributeError: ‘numpy.int64’ object has no attribute ‘bit_length’. The bit_length() method is a built-in Python method for regular integer objects, but it’s not available for NumPy data types

Using the pandas module

One important thing to mention here is that pandas function require input as a pandas Series or dataframe. Hence, we will need to convert our Numpy list into pandas Series

import pandas as pd

series_of_numbers_pd = pd.Series(list_of_numbers)

print(f'Standard Deviation using the pandas for the Series of numbers provided = {series_of_numbers_pd.std() :0.2f}')

Standard Deviation using the pandas for the Series of numbers provided = 9.85

Using the numpy module

print(f'Standard Deviation using numpy for the Series of numbers provided = {np.std(list_of_numbers) :0.2f}')

Standard Deviation using numpy for the Series of numbers provided = 9.34

Using pyTorch

One important thing to mention here is that pytorch function require input as a pytorch tensor. Hence, we will need to convert our list into torch tensor. Another thing worth mentioning is that torch functions require floating point numbers as input tensors. Hence, we’ll convert all items in our list to float using list comprehension.

import torch

tensors_of_numbers = torch.tensor([float(x) for x in list_of_numbers])
tensors_of_numbers

tensor([ 5., 27., 12.,  5.,  2., 20., 10.,  8., 29., 23.])

print(f'Standard Deviation using torch for the Series of numbers provided = {torch.std(tensors_of_numbers) :0.2f}')

Standard Deviation using torch for the Series of numbers provided = 9.85

Notice any Discrepancies ??

As observed above, you might have noticed that the S.D for our pure function and numpy function gave a result of 9.34, however for the statistics and pandas in-built function, the result is 9.85. The reason for the difference is the denominator of the Standard Deviation formula. The reason is because statistics and pandas calculate the S.D. based on the sampling distribution and not the population distribution. The Sample Stansard Deviation has N-1 as denominator in it’s formula, while the population SD formula has N.

Population Standard Deviation \[\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}\]

Sample Standard Deviation \[s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}\]

Since, we don’t normally know the population mean, we have to use sample mean to calculate variance, and this introduces bias. We estimate the population variance (\(\sigma²\)) based on the sample standard deviation (s²). Let Mean for sample be \(\bar{x}\) and population mean be ų. Data points \(x1, x2..xi\) will be more closer to \(\bar{x}\) than to ų.

This makes \(\sum(xi-\bar{x})²\) to be smaller than \(\sum(x_i-ų)²\) (Numerator of the formula). To compensate for this loss, the formula has a division of N-1 instead of just N when estimating variance.

This is based on the concept of degrees of freedom (df). It represents the number of independent pieces of information available for estimating the population standard deviation when calculating the sample standard deviation. It is “n - 1” for a sample of size “n.”